Working With Data

October 27, 2023

What are Tidy Data?


  • Each column represents a single variable
  • Each row represents a single observation
  • Each cell represents a single value

Tidy Data Example

What are Clean Data?

  • Column names are easy to work with and are not duplicated
  • Missing values have been dealt with
  • There are no repeated observations or columns
  • There are no blank observations or columns
  • The data are in the proper format, for example dates should be formatted as dates

Messy Data Example

Step 1: Downloaded Some Messy Data


  • Created a folder called “data” in your course modules project folder
  • Downloaded some messy WB data

Step 2: Read the Data Into R


# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/your_file_name.csv")

# View the data
glimpse(wb_data_messy)

Viewed with View("wb_data_messy") or by clicking on object…

Step 3: Wrote a New CSV File


write_csv(wb_data_messy, "data/your_new_file_name.csv")

With Excel

  • Can do the same thing with Excel files
  • Have a look at the readr cheatsheet
  • Follow same steps as with CSV file
    • use read_excel() to read in the data
    • install and experiment with writexl

Messy Data Example

Pivot Longer


pivot_longer() takes three arguments:

  • cols - which columns you want to pivot
  • names_to - the name of the column where the old column names are going to (identifier)
  • values_to - the name of the column where the values are going to

Can you remember how to make pivot_longer() work?


# Load tidyr
library(tidyr)

# Reshape the data
      <-    |>
  pivot_longer(
    cols = 
    names_to = 
    values_to = 
  )

glimpse()


# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> 
  pivot_longer(             
    cols = `1973 [YR1973]`: `2022 [YR2022]`, # columns to pivot
    names_to = "year", # name the identifier column "year"
    values_to = "flfp" # name the numeric var column "flfp"
  ) 

# View the data
glimpse(wb_data)

Mutating Variables


  • Anytime we want to change a variable, we are going to use the dplyr verbs mutate() or mutate_at()
  • mutate() is if you want to change on variable
  • mutate_at() is for multiple variables

Fixing Our Variables


# Fix year and flfp
wb_data <- wb_data |> # replace wb_data with a modified version of the dataframe 
  mutate(year = substring(year, 1, 4)) |> # truncate year (keep first four characters)
  mutate_at(c("year", "flfp"), as.numeric) # change year and flfp to numeric

# View the data
glimpse(wb_data)

Clean Variable Names

Very simple: use the janitor package!


# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)

APIs

  • API stands for “Application Programming Interface”
  • Way for two computers to talk to each other
  • In our case, we will use APIs to download social science data
  • APIs are accessed through packages in R
  • Sometimes there can be more than one package for an API
  • Usually easier than reading in data from messy flat file

Datasets

  • World Bank data through wbstats
    • There is another package called WDI
  • Varieties of Democracy (V-Dem) through vdemdata
    • This is the only package for V-Dem
    • No arguments
    • Just downloads all the data!

Searching for WB Indicators


flfp_indicators <- wb_search("female labor force") # store the list of indicators

print(flfp_indicators, n=26) # view the indicators

Try it Yourself

  • Start a new Quarto document
  • Start a new code chunk
  • Load the wbstats package
  • Use wb_search() to find some indicators you are interested in
  • If search is not successful, visit WDI site
05:00

wbstats Example

# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
women_emp <- wb_data(indicators, mrv = 50) |> # download data for last 50 yrs
  select(!iso2c) |> # drop the iso2c code which we won't be using
  rename(year = date) |> # rename date to year 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(women_emp) 

Try it Yourself


  • Using the variables you identified in your search, download the data
  • Think about what elements of the code you need to change
  • Hint: the indicators are going to change!
05:00

Downloading V-Dem Data

  • The vdem function from vdemdata just downloads the data
  • There are no parameters for selecting indicators or date ranges
  • So we have to do more work with dplyr functions
    • filter() for years
    • select() for variables
  • We are also going to use case_match() to add

vdemdata Example

# Load packages
library(vdemdata) # to download V-Dem data

# Download the data
democracy <- vdem |> # download the V-Dem dataset
  filter(year >= 1990)  |> # filter out years less than 1990
  select(                  # select (and rename) these variables
    country = country_name,     # the name before the = sign is the new name  
    vdem_ctry_id = country_id,  # the name after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
                    # number on the left of the ~ is the V-Dem region code
                    # we are changing the number to the country name on the right
                    # of the equals sign
  )

# View the data
glimpse(democracy)

Try it Yourself

  • Go to the V-Dem Codebook
  • Select a democracy indicator from Part II and one other indicator
  • Note the indicator code (e.g. “v2x_polyarchy” for the polyarchy score)
  • In your Quarto notes document, create a new code chunk
  • Use the codes for your selected indicators to download the data
05:00